Export documents and assets (WHIT-2419) by ChrisBAshton · Pull Request #3311 · alphagov/content-publisher

ChrisBAshton · 2025-08-29T11:47:32Z

What

This PR defines two rake tasks that can be used to export content out of Content Publisher:

export:live_document_and_assets, which exports a single document, given its content ID. By default it outputs to STDOUT but will write to a JSON file if a path is provided.
export:live_documents_and_assets, which exports all of the live documents and assets. By default it outputs to STDOUT but will write a separate JSON file for each exported document if an output directory is provided.

Usage (respectively):

# single document
rake export:live_document_and_assets["4c2ff601-4494-458b-bdc3-7d358122c2ae","./chris/foo.json"]

# all the documents
rake export:live_documents_and_assets["./chris"]

The latter rake task is what we imagine we'll use alongside an import script in Whitehall, to be built in https://gov-uk.atlassian.net/browse/WHIT-2440. Sister PR: alphagov/whitehall#10635

Why

We want to retire Content Publisher. Whilst we could 'withdraw' all content to avoid having to migrate it, we'd then need to make sure we're able to find some way of hacking an update directly into Publishing API in an emergency, and we figured it would ultimately just complicate the publishing landscape even further. Better to export the content out of Content Publisher and back into Whitehall, where it can be managed centrally.

On investigation, there are only around 3000 documents to export, of two different content types, and they're all low value pieces (news articles that have served their purpose already and, as stated above, have already been cleared for mass-withdrawal by our Content colleagues provided we can be sure we can apply edits retrospectively). For that reason, a thorough reproduction of every bit of document history is not necessary here: we want to do a lift and shift of content as best as we can without expending weeks' worth of effort. We're therefore only exporting the latest published edition for each document (no drafts) and are not bothering to transfer over the removed or unpublished-with-redirect items, since these would mean extra complexity for virtually no benefit.

JIRA: https://gov-uk.atlassian.net/browse/WHIT-2419

⚠️ This repo is Continuously Deployed: make sure you follow the guidance ⚠️

Follow these steps if you are doing a Rails upgrade.

Content Publisher has two means of designating an edition to be 'political': 1. 'system political': if associated with role appointments or political organisations 2. 'editor political': if the publisher has explicitly marked the document as political via the "Gets history mode" checkbox in Content Publisher's UI. For the purposes of importing these docs into Whitehall, we don't particularly care the _means_ by which a document is considered political. We don't need the overhead of having to map one 'flavour' of political to Whitehall's representation of that 'flavour'. It is enough to just pass the derived boolean from Content Publisher, into Whitehall (which has its own 'political' boolean override equivalent: https://github.com/alphagov/whitehall/blob/f5afce65514b02767df8625ccfa40c681046c95e/db/schema.rb#L444)

Whitehall stores change notes and editorial remarks _per edition_, yet we're only planning to migrate the latest (published) edition out of Content Publisher, for simplicity. So we imagine this property will be used to form one large 'editorial remark' in Whitehall (and potentially as one large entry in the public changenote history too). The TimelineEntry class is doing the heavy lifting here. I've generally followed the same logic as _content_publisher_entry.html.erb in deciding which properties to expose. The way we're testing this behaviour (by layering edition upon edition in a unit test) doesn't appear to have been done before, as we were getting an Uninitialized Constant error on `RevisionUpdater::Image`. Adding a `require_relative` to the `RevisionUpdater` class worked around that.

lauraghiorghisor-tw

Looks okay to me, few suggestions around redundant or needed data in comments and plenty of musings that might be useful for the import work.

FYI - There are quite a few TODOs in the commit messages, which are actually still to do.

Having done some import work with Tony before, I would suggest going through the exercise of creating both document types in WH from the console (including image and file attachment) to better understand what you actually need and address those TODOs. That is, before merging this in. Otherwise I bet this will be followed up by a few other PRs 😅 #burned_before

Whilst it's nice and clean to keep the export WH-agnostic, I'd probably go one step further and do the required mappings here already (like CP state to WH state, and other specific fields we'd want there). What we learned from the other import work was that, if you keep the export mappings un-opinionated, you just end up having to test both ends and it wasn't worth it as this is actually an export for WH anyway.

Whilst digging through assets I realised that plenty of those which are not strictly on the live edition are still live. Perhaps moving everything in WH takes us one step closer to being able to do a data cleanup in AM one day.

This commit replaces `all_asset_variants_uploaded?` with a simpler `asset_uploaded?` method for both images and attachments. Instead of requiring all variant files (e.g. `s960`, `s300`, etc.) to be present before allowing publishing or previewing, we now consider the asset “ready” if the original file has successfully uploaded. In practice, all variants are generated and uploaded within milliseconds of the original upload, and failures nearly always stem from the original file itself. For example, if Asset Manager has an outage or detects a virus in the uploaded asset, then the original variant will fail to successfully upload, just as the variants will not. If the original variant has successfully uploaded, then we can be highly confident (though can't _guarantee_) that its variants will have to. The minimal risk hypothetical edge case is a risk we can live with to allow for the following benefits: * Supports content with partial variant sets (we will be importing news articles from Content Publisher, which have only the `s960` and `s300` variants). * Enables future support for dynamically generated variants (e.g. via Fastly image resizing) without requiring full variant pre-generation. * Greatly simplifies the codebase and removes brittle assumptions. See alphagov/content-publisher#3311 (comment) for a discussion on the different variant sizes and where they're used. As you can see in this commit, there is a lot of room for improvement and consolidation across Whitehall. We have several implementations of `asset_uploaded?` that are identical across modules. In practice, most of these implementations were already only checking for the 'original' variant, and have now been simplified to make that absolutely explicit. We should revisit this commit later to consider how to extract some of this behaviour out into a shared module so that we're not duplicating identical logic in multiple places.

We expect that we'll drop the `high_resolution` variant at the point of importing into Whitehall, but have kept it there for completion. The existing 960-wide and 300-wide variants match some of the allowed sized in Whitehall, so should be importable. We expect the exported image can be mapped to Whitehall's modelling like so: ``` $ Image.last => #<Image:0x0000ffff63777d98 id: 796660, image_data_id: 258369, edition_id: 1691184, alt_text: <ALT TEXT GOES HERE>, caption: <CAPTION GOES HERE>, created_at: "2025-06-22 17:38:04.000000000 +0100", updated_at: "2025-06-22 17:38:07.000000000 +0100"> $ Image.last.image_data => #<ImageData:0x0000ffff634668c0 id: 258369, carrierwave_image: "ELECTRICITY_BILLS_TO_BE_SLASHED_FOR_BUSINESSES_1__...", # TODO: is there a problem here? created_at: "2025-06-22 17:38:04.000000000 +0100", updated_at: "2025-06-22 17:38:04.000000000 +0100", image_kind: "default"> $ Image.last.image_data.assets => [#<Asset:0x0000ffff6302e7a8 id: 3930905, asset_manager_id: <DERIVE THE ASSET MANAGER ID FROM THE URL AND PUT THAT HERE>, variant: <VARIANT GOES HERE with 's' prefix, e.g. 's300'>, created_at: "2025-06-22 17:38:05.267635000 +0100", updated_at: "2025-06-22 17:38:05.267635000 +0100", assetable_type: "ImageData", assetable_id: 258369, filename: <DERIVE THE FILENAME FROM THE URL AND PUT THAT HERE>, <Asset:0x0000ffff63d33340 ... # and so on TODO: will there be an issue with the missing s216/s465/s630 variants? ```

The original implementation exposed the following properties as well: - isbn - unique_reference - paper_number - parliamentary_session - official_document_type ...but Content Publisher only has 54 file attachments in total, and none of them define values for any of these properties - they are all `nil`. So there is little point in exposing them as part of the export. We imagine the exported attachment can be fairly easily mapped into Whitehall's attachment modelling: ``` whitehall(prod)> Attachment.find_by(attachment_data: 1343266) => <FileAttachment:0x0000ffff766fd5d0 id: 8823005, created_at: "2025-08-29 16:47:37.000000000 +0100", updated_at: "2025-08-29 16:47:37.000000000 +0100", title: <TITLE GOES HERE> accessible: false, # As per https://gds.slack.com/archives/C03D4A72HGE/p1755694877638679 - all non-HTML attachments should have this unchecked as per guidance. isbn: nil, unique_reference: nil, command_paper_number: nil, hoc_paper_number: nil, unnumbered_command_paper: nil, unnumbered_hoc_paper: nil, parliamentary_session: nil, type: "FileAttachment", ...> ``` ...and: ``` AttachmentData.find(1343266) => <AttachmentData:0x0000ffff7671a518 id: 1343266, carrierwave_file: "sample.pdf", # TODO?? content_type: "application/pdf", # TODO?? file_size: 131514, # TODO?? number_of_pages: 6, # TODO?? created_at: "2025-08-29 16:47:37.000000000 +0100", updated_at: "2025-08-29 16:47:37.000000000 +0100", replaced_by_id: nil> ``` Some things to be worked out about the AttachmentData - but I imagine the import script will need to `curl` the attachment and figure out its file size, number of pages and content type.

I've created two rake tasks: 1. `export:live_document_and_assets`, which exports a single document, given its content ID. By default it outputs to STDOUT but will write to a JSON file if a path is provided. 2. `export:live_documents_and_assets`, which exports all of the live documents and assets. By default it outputs to STDOUT but will write a separate JSON file for each exported document if an output directory is provided. Usage (respectively): ``` rake export:live_document_and_assets["4c2ff601-4494-458b-bdc3-7d358122c2ae","./chris/foo.json"] ``` ``` rake export:live_documents_and_assets["./chris"] ```

We're likely to use the former in Whitehall. We won't use the latter (there is no DB field for it).

We need to be able to associate a news article with a government ID to properly enable history mode.

The 'document_history' property wasn't exposing any public changenotes. We'll revisit document_history in the next commit. I did consider rolling out my own change_notes implementation: ``` def self.change_notes(document) change_notes = document.editions.map do |edition| change_note = edition.revision.metadata_revision.change_note if change_note.present? { change_note:, created_at: edition.created_at, created_by: User.find(edition.created_by_id).email } end end change_notes.compact end ``` ...but on closer inspection, there is a PublishingApiPayload `History` class that does exactly what we want and is already fully tested: https://github.com/alphagov/content-publisher/blob/ef1ee481f5309e6fd010967231ff5b4723db60ca/spec/lib/publishing_api_payload/history_spec.rb#L1 It is enough for our tests to check that that class is called. That class also handles the backdating of `first_published_at`, so we're fixing that whilst we're here.

In the initial implementation of this, we'd wrongly assumed that public changenotes were included in the document history displayed to publishers in the UI. We've since defined a separate `change_notes` method for capturing that. That new method also includes the 'backdating' information, and we've updated `first_published_at` to accommodate backdating too, so it's no longer required to be surfaced here. What is still needed is some means of exposing 'internal' change history, i.e. internal notes / editorial remarks, but also the option of carrying over update/publish dates of each revision and the author of each. So in this commit we've renamed the method to better describe the new scope - `internal_history` and have dropped any complexity around backdating.

This commit replaces `all_asset_variants_uploaded?` with a simpler `asset_uploaded?` method for both images and attachments. Instead of requiring all variant files (e.g. `s960`, `s300`, etc.) to be present before allowing publishing or previewing, we now consider the asset “ready” if the original file has successfully uploaded. In practice, all variants are generated and uploaded within milliseconds of the original upload, and failures nearly always stem from the original file itself. For example, if Asset Manager has an outage or detects a virus in the uploaded asset, then the original variant will fail to successfully upload, just as the variants will not. If the original variant has successfully uploaded, then we can be highly confident (though can't _guarantee_) that its variants will have to. The minimal risk hypothetical edge case is a risk we can live with to allow for the following benefits: * Supports content with partial variant sets (we will be importing news articles from Content Publisher, which have only the `s960` and `s300` variants). * Enables future support for dynamically generated variants (e.g. via Fastly image resizing) without requiring full variant pre-generation. * Greatly simplifies the codebase and removes brittle assumptions. See alphagov/content-publisher#3311 (comment) for a discussion on the different variant sizes and where they're used. As you can see in this commit, there is a lot of room for improvement and consolidation across Whitehall. We have several implementations of `asset_uploaded?` that are identical across modules. In practice, most of these implementations were already only checking for the 'original' variant, and have now been simplified to make that absolutely explicit. We should revisit this commit later to consider how to extract some of this behaviour out into a shared module so that we're not duplicating identical logic in multiple places.

ChrisBAshton · 2025-09-10T08:09:16Z

Thanks @lauraghiorghisor-tw - I'm happy enough that the export gives us everything we need for alphagov/whitehall#10635 now. Merging 🎉

ChrisBAshton force-pushed the export-news-articles branch 2 times, most recently from 528572f to 08c347c Compare August 29, 2025 13:36

ChrisBAshton changed the title ~~WIP: export rake task (WHIT-2419)~~ WIP: Export documents and assets (WHIT-2419) Aug 29, 2025

ChrisBAshton force-pushed the export-news-articles branch 15 times, most recently from 536a1d1 to 2699dab Compare September 1, 2025 09:15

GDSNewt force-pushed the export-news-articles branch from b5cb257 to 55870ef Compare September 1, 2025 13:16

ChrisBAshton force-pushed the export-news-articles branch 11 times, most recently from fc2c1f8 to 2bf11a8 Compare September 2, 2025 13:05

ChrisBAshton added 2 commits September 2, 2025 14:14

ChrisBAshton force-pushed the export-news-articles branch from 2bf11a8 to 10a40ae Compare September 2, 2025 13:14

lauraghiorghisor-tw approved these changes Sep 3, 2025

View reviewed changes

Comment thread app/models/whitehall_migration/document_export.rb Outdated

Comment thread app/models/whitehall_migration/document_export.rb

ChrisBAshton force-pushed the export-news-articles branch 2 times, most recently from 00420e8 to 4943459 Compare September 4, 2025 16:56

lauraghiorghisor-tw reviewed Sep 5, 2025

View reviewed changes

Comment thread app/models/whitehall_migration/document_export.rb

lauraghiorghisor-tw reviewed Sep 5, 2025

View reviewed changes

Comment thread app/models/whitehall_migration/document_export.rb

ChrisBAshton commented Sep 8, 2025

View reviewed changes

Comment thread app/models/whitehall_migration/document_export.rb Outdated

ChrisBAshton force-pushed the export-news-articles branch 2 times, most recently from 72e617b to ac61340 Compare September 8, 2025 08:03

GDSNewt and others added 7 commits September 9, 2025 09:43

Report 'created_at' instead of 'created_by' information

dcfbf9a

We're likely to use the former in Whitehall. We won't use the latter (there is no DB field for it).

Add 'government_id' property

5d8425c

We need to be able to associate a news article with a government ID to properly enable history mode.

ChrisBAshton force-pushed the export-news-articles branch from ac61340 to 50a075b Compare September 9, 2025 08:43

ChrisBAshton mentioned this pull request Sep 9, 2025

Simplify asset upload check to verify only the original asset (WHIT-2440) alphagov/whitehall#10647

Closed

ChrisBAshton mentioned this pull request Sep 10, 2025

Import news articles from Content Publisher (WHIT-2440) alphagov/whitehall#10635

Merged

ChrisBAshton merged commit 7a999c7 into main Sep 10, 2025
11 checks passed

ChrisBAshton deleted the export-news-articles branch September 10, 2025 08:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export documents and assets (WHIT-2419)#3311

Export documents and assets (WHIT-2419)#3311
ChrisBAshton merged 16 commits intomainfrom
export-news-articles

ChrisBAshton commented Aug 29, 2025 •

edited

Loading

Uh oh!

lauraghiorghisor-tw left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChrisBAshton commented Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ChrisBAshton commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Uh oh!

lauraghiorghisor-tw left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChrisBAshton commented Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ChrisBAshton commented Aug 29, 2025 •

edited

Loading

lauraghiorghisor-tw left a comment •

edited

Loading